X STUDYID USUBJID CNSR TRT01P PPROTFL STARTDT ADT AVAL
1 1 54767414MMY3008 SUBJ0001 0 TREAT A Y 2019-04-22 2021-01-12 631
2 2 54767414MMY3008 SUBJ0002 0 TREAT A Y 2021-10-13 2024-04-04 904
3 3 54767414MMY3008 SUBJ0003 0 TREAT B Y 2020-10-31 2020-11-30 30
4 4 54767414MMY3008 SUBJ0004 1 TREAT A Y 2020-12-31 2021-09-25 268
5 5 54767414MMY3008 SUBJ0005 1 TREAT A Y 2017-10-26 2018-07-13 260
6 6 54767414MMY3008 SUBJ0006 0 TREAT B Y 2017-02-02 2017-12-13 314
EVNTDESC RSSTRESC TOTAE SAECOUNT AEDECOD_MS
1 DEATH Stable Disease 1 0 Nausea
2 DEATH Progressive Disease 2 0 Fatigue
3 DEATH Partial Response NA NA <NA>
4 ANALYSIS CUT OFF Progressive Disease 4 1 Dizziness
5 LAST CONTACT Stable Disease 4 1 Fatigue
6 DEATH Complete Response 2 0 Headache
AEBODSYS AESER AGE SEX RACE ETHNIC LDH
1 Gastrointestinal N 70.51563 Female White Not Hispanic or Latino 139.7219
2 Nervous System N 79.40260 Female White Not Hispanic or Latino 115.6110
3 <NA> <NA> 76.33235 Male White Not Hispanic or Latino 172.8269
4 Gastrointestinal N 77.44750 Male White Not Hispanic or Latino 144.8748
5 Nervous System N 72.06439 Female White Not Hispanic or Latino 161.9881
6 General Disorders N 65.23182 Male White Not Hispanic or Latino 175.1388
ALB HGB EXTRT EXDOSE BMI ECOG CRP
1 155.2226 135.7905 Dexamethasone 110 26.83260 4 12.830053
2 128.2108 131.7716 Daratumumab 30 25.42065 1 6.702462
3 159.3242 142.5922 Dexamethasone 60 32.02936 2 11.959545
4 171.4453 129.9159 Lenalidomide 150 22.04537 1 2.101916
5 131.3127 164.3331 Daratumumab 130 20.38940 1 7.230141
6 141.2525 159.1366 Lenalidomide 20 21.33275 1 6.671318
First we want to understand the struture of the data and the data type of each variables
We are also interested in the other categorical variables, such as demographic information and Adverse Event related variables.
Code
categorical_columns <-c("SEX", "RACE", "ETHNIC","EXTRT", "RSSTRESC", "AEDECOD_MS", "AEBODSYS", "AESER", "TRT01P")for (col in categorical_columns) {# Create a bar plot p <-ggplot(data, aes_string(x = col)) +geom_bar(fill ="steelblue", color ="black") +geom_text(stat ="count", aes(label = ..count..), vjust =-0.5, size =3) +labs(title =paste("Bar Plot of", col),x = col,y ="Count" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) # Rotate x-axis labels if needed# Print the plotprint(p)}
One thing to notice is that the majority of the subjects are white, and most of the subjects don’t have serious Adverse Event. Other than these, other categorical variables are balanced.
Now we want to take a look at the numerical variablesWe first want to know if there are outliers in each numerical variables
Code
numerical_columns <-c("LDH", "ALB", "HGB", "EXDOSE", "BMI", "ECOG", "CRP", "TOTAE", "SAECOUNT", "AGE")for (col in numerical_columns) { p <-ggplot(data, aes_string(y = col)) +geom_boxplot(fill ="lightblue", color ="black", outlier.colour ="red") +labs(title =paste("Boxplot of", col), y = col) +theme_minimal()print(p)}
Next we want to check the distribution of the numerical variables
Code
for (col in numerical_columns) { p <-ggplot(data, aes_string(x = col)) +geom_density(fill ="steelblue", alpha =0.6) +labs(title =paste("Density Plot of", col), x = col, y ="Density") +theme_minimal()print(p)}
LDH, ALB and HGB look similar, we can make a reasonable assumption that they are correlated. We can see that Age, MBI and CRP are all nearly normally distributed. For ECOG, TOTAE and SAECOUNT, since their values are all integer, their distribution plot will seem a little weird, but it is within our expectation.
Now we are curious about the statistical summary of these numerical columns.
LDH ALB HGB EXDOSE
Min. : 3.383 Min. : 2.986 Min. : 2.793 Min. : 10.0
1st Qu.: 4.868 1st Qu.: 4.900 1st Qu.: 4.855 1st Qu.: 50.0
Median : 12.855 Median : 13.049 Median : 12.953 Median :110.0
Mean : 56.224 Mean : 56.568 Mean : 54.914 Mean :105.1
3rd Qu.:137.719 3rd Qu.:138.617 3rd Qu.:133.987 3rd Qu.:150.0
Max. :211.675 Max. :194.763 Max. :224.314 Max. :200.0
BMI ECOG CRP TOTAE
Min. :10.77 Min. :0.000 Min. : 1.843 Min. :1.000
1st Qu.:21.92 1st Qu.:1.000 1st Qu.: 7.784 1st Qu.:1.000
Median :25.60 Median :2.000 Median : 9.767 Median :2.000
Mean :25.35 Mean :2.059 Mean : 9.807 Mean :2.462
3rd Qu.:28.94 3rd Qu.:3.000 3rd Qu.:11.812 3rd Qu.:3.000
Max. :42.26 Max. :4.000 Max. :18.781 Max. :8.000
NA's :88
SAECOUNT AGE
Min. :0.0000 Min. :57.35
1st Qu.:0.0000 1st Qu.:69.80
Median :0.0000 Median :73.25
Mean :0.3496 Mean :73.06
3rd Qu.:1.0000 3rd Qu.:76.42
Max. :3.0000 Max. :87.47
NA's :88
Furthermore we are curious about the correlation between these numerical vairables
Code
library(reshape2)library(tidyr)# Calculate the correlation matrix for the numerical columnscor_matrix <-cor(data[numerical_columns], use ="complete.obs", method ="pearson")cor_matrix_melted <-melt(cor_matrix)# Create the heatmap using ggplot2ggplot(cor_matrix_melted, aes(Var1, Var2, fill = value)) +geom_tile() +scale_fill_gradient2(midpoint =0, low ="red", high ="green", mid ="white") +theme_minimal() +labs(title ="Correlation Heatmap", x ="Variables", y ="Variables") +theme(axis.text.x =element_text(angle =45, hjust =1))
We can see that only LDH ALB and HGB are correlated to each other, and we may consider choosing only one of these biomarker in our analysis
We now have a decent understanding of this data, now we can move to data preprocess and feature selections for our models.